-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding website ingestion! #325
Conversation
WalkthroughThe changes introduce several new classes and methods for web crawling and data processing. A Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant WebsiteETL
participant CrawleeClient
participant CustomIngestionPipeline
User->>WebsiteETL: extract(urls)
WebsiteETL->>CrawleeClient: crawl(urls)
CrawleeClient-->>WebsiteETL: crawled data
WebsiteETL->>WebsiteETL: transform(crawled data)
WebsiteETL->>CustomIngestionPipeline: load(transformed documents)
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 14
🧹 Outside diff range and nitpick comments (13)
requirements.txt (1)
25-25
: Consider documenting playwright system dependenciesThe
[playwright]
extra requires additional system dependencies for browser automation.Consider adding a note in the repository's README.md about the system requirements for playwright, such as:
- Required browser dependencies
- Minimum system requirements
- Installation steps for different operating systems
This will help other developers set up the environment correctly.
dags/hivemind_etl_helpers/src/utils/modules/website.py (3)
5-5
: Remove extra blank line.There's an unnecessary extra blank line that can be removed.
11-30
: Consider improving type hints and docstring.While the documentation is good, there are a few improvements that could make it more maintainable and clearer:
- The complex return type could benefit from a type alias
- The docstring could include information about potential exceptions
- The docstring could clarify the expected format of the URLs
Consider applying these improvements:
+from typing import TypeAlias + +WebsiteCommunityData: TypeAlias = dict[str, str | list[str]] + def get_learning_platforms( self, -) -> list[dict[str, str | list[str]]]: +) -> list[WebsiteCommunityData]: """ Get all the website communities with their page titles. + + Raises + ------ + Exception + When unable to fetch platform metadata + + Notes + ----- + URLs should be fully qualified (including protocol) and accessible
31-33
: Consider batch processing for performance optimization.The current implementation queries platform metadata individually for each platform. Consider implementing batch processing if the underlying
get_platform_metadata
method supports it.dags/hivemind_etl_helpers/src/db/website/crawlee_client.py (1)
7-61
: Consider architectural improvements for productionFor production-grade web crawling, consider these architectural improvements:
- Add request queuing system (e.g., Redis) for distributed crawling
- Implement retry mechanism with exponential backoff
- Add metrics collection (success rate, timing, etc.)
- Consider using proxy rotation for large-scale crawling
Would you like me to provide implementation details for any of these suggestions?
dags/hivemind_website_ingestion.py (1)
15-98
: Consider architectural improvementsBoth DAGs share similar structure and functionality. Consider these improvements:
- Create a base DAG class or factory function to reduce code duplication
- Implement proper error handling and retries for the ETL tasks
- Add monitoring and alerting for task failures
Would you like help implementing these architectural improvements?
🧰 Tools
🪛 Ruff
36-36: Local variable
links
is assigned to but never usedRemove assignment to unused variable
links
(F841)
44-44: Undefined name
selected_channels
(F821)
45-45: Undefined name
from_date
(F821)
52-52: Undefined name
start_discord_vectorstore
(F821)
70-70: Undefined name
ModulesDiscord
(F821)
dags/hivemind_etl_helpers/src/utils/modules/modules_base.py (1)
Line range hint
101-115
: Update docstring to match the method's current purposeThe docstring is inconsistent with the method's current functionality:
- The description still states "get the userid that belongs to a platform" but the method now handles generic metadata
- The return type in the docstring uses "Any" but the method signature specifies
str | dict | list
Apply this diff to fix the docstring:
- get the userid that belongs to a platform + get metadata for a platform Parameters ----------- platform_id : bson.ObjectId the platform id we need their owner user id metadata_name : str a specific field of metadata that we want Returns --------- - metadata_value : Any + metadata_value : str | dict | list the values that the metadata belongs todags/hivemind_etl_helpers/website_etl.py (6)
20-20
: Fix typo in docstringThere's a typo in the docstring: "ingegration" should be "integration".
Apply this diff to correct the typo:
- notion ingegration access token + notion integration access token
94-94
: Declare_prepare_id
as a@staticmethod
Since the method
_prepare_id
does not useself
, it should be declared as a@staticmethod
to indicate that it doesn't rely on instance variables.Apply this diff to declare the method as static:
+ @staticmethod def _prepare_id(data: str) -> str:
94-109
: Remove unused method_prepare_id
if not neededThe
_prepare_id
method is defined but not used anywhere in the class. If it's not required, consider removing it to keep the codebase clean.
98-107
: Ensure consistent docstring formattingThe docstring sections use inconsistent underline lengths. For better readability, the number of dashes should match the length of the section titles.
Example correction:
""" hash the given data to prepare an id for the document Parameters - ----------- + ---------- data : str the string data to be hashed Returns - -------- + ------- hashed_data : str data hashed using sha256 algorithm """
86-87
: Correct the type hint in the docstringIn the
load
method's docstring, the type hint fordocuments
should be updated to match the method signature.Apply this diff to correct the type hint:
Parameters ------------- - documents: list[llama_index.Document] + documents: list[Document] the llama-index documents to be ingested """
55-64
: Ensure consistent use of type hints in docstringsIn the
transform
method's docstring, ensure that the type hints are consistent with the method signature.Apply this diff to correct the return type in the docstring:
Returns --------- - documents : list[llama_index.Document] + documents : list[Document] list of llama-index documents """
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (7)
dags/hivemind_etl_helpers/src/db/website/crawlee_client.py
(1 hunks)dags/hivemind_etl_helpers/src/utils/modules/__init__.py
(1 hunks)dags/hivemind_etl_helpers/src/utils/modules/modules_base.py
(2 hunks)dags/hivemind_etl_helpers/src/utils/modules/website.py
(1 hunks)dags/hivemind_etl_helpers/website_etl.py
(1 hunks)dags/hivemind_website_ingestion.py
(1 hunks)requirements.txt
(1 hunks)
🧰 Additional context used
🪛 Ruff
dags/hivemind_website_ingestion.py
36-36: Local variable links
is assigned to but never used
Remove assignment to unused variable links
(F841)
44-44: Undefined name selected_channels
(F821)
45-45: Undefined name from_date
(F821)
52-52: Undefined name start_discord_vectorstore
(F821)
70-70: Undefined name ModulesDiscord
(F821)
🔇 Additional comments (5)
dags/hivemind_etl_helpers/src/utils/modules/__init__.py (1)
8-8
: LGTM! The import follows the established pattern.
The new import maintains consistency with the existing import style and alphabetical ordering.
Let's verify the existence of the imported module and class:
✅ Verification successful
Import verified and implementation looks correct
The ModulesWebsite
class exists in the website.py
module and is properly implemented as a subclass of ModulesBase
with appropriate website-specific functionality for fetching community data. The import in __init__.py
is valid and consistent with the codebase.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the existence of website.py and ModulesWebsite class
# Expected: Find the website.py file and confirm ModulesWebsite class definition
# Check if website.py exists in the same directory
fd --type f "website.py" dags/hivemind_etl_helpers/src/utils/modules/
# Verify ModulesWebsite class definition
ast-grep --pattern 'class ModulesWebsite' dags/hivemind_etl_helpers/src/utils/modules/website.py
Length of output: 223
Script:
#!/bin/bash
# Let's verify the actual implementation of ModulesWebsite class
ast-grep --pattern 'class ModulesWebsite {
$$$
}'
Length of output: 53
Script:
#!/bin/bash
# Let's try a different pattern for Python class definition
ast-grep --pattern 'class ModulesWebsite:
$$$'
Length of output: 50
Script:
#!/bin/bash
# Let's check the content of the website.py file directly
cat dags/hivemind_etl_helpers/src/utils/modules/website.py
Length of output: 2067
requirements.txt (1)
25-25
: Verify crawlee version and security implications
The specified version 0.3.8 is not the latest release, and being a 0.x version suggests potential API instability.
Let's verify the version and security status:
✅ Verification successful
Consider upgrading crawlee to the latest stable version
The current version 0.3.8 is outdated compared to the latest version 0.4.2b4 (released Nov 15, 2024). While no security vulnerabilities were found, using an older version might miss important fixes and improvements. However, since the latest version is still in beta (0.4.2b4), staying on 0.3.8 could be a conscious choice for stability.
- Consider upgrading to the latest stable version in the 0.3.x series if available
- Document the reason for pinning to 0.3.8 if this version is specifically required
Consider upgrading to crawlee 0.3.9
The pinned version 0.3.8 is not the latest stable release in the 0.3.x series. Version 0.3.9 is available and would provide the latest bug fixes while maintaining API compatibility. No security vulnerabilities were found for any version.
- Update the requirement to
crawlee[playwright]==0.3.9
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check latest version and release date
curl -s https://pypi.org/pypi/crawlee/json | jq -r '.releases | keys_unsorted[-1] as $latest | "Latest version: \($latest)\nRelease date: \(.[$latest][0].upload_time)"'
# Check for any security advisories
gh api graphql -f query='
{
securityVulnerabilities(first: 5, ecosystem: PIP, package: "crawlee") {
nodes {
advisory {
summary
severity
publishedAt
}
vulnerableVersionRange
firstPatchedVersion {
identifier
}
}
}
}'
Length of output: 569
Script:
#!/bin/bash
# Check all available versions to find latest stable in 0.3.x series
curl -s https://pypi.org/pypi/crawlee/json | jq -r '.releases | keys[]' | grep '^0\.3\.' | sort -V
Length of output: 513
dags/hivemind_etl_helpers/src/utils/modules/website.py (1)
6-9
: LGTM! Clean class structure with proper inheritance.
The class is well-structured with appropriate inheritance from ModulesBase
and proper initialization sequence in the constructor.
dags/hivemind_etl_helpers/src/utils/modules/modules_base.py (1)
101-101
: Verify the impact of the return type change
The return type change from str | dict
to str | dict | list
looks good and aligns with the website ingestion requirements. However, let's verify that all callers can handle the new list type.
✅ Verification successful
Let me analyze the usage of get_platform_metadata
across the codebase to verify the impact of the return type change.
From the search results, I can see all the callers and how they handle the return value:
- In
website.py
: Used to get "resources" and stored inwebsite_links
- In
modules_base.py
: Used to get "userId" which is then converted to ObjectId - In
github.py
: Used to get "account" with explicit type checking for dict - In
mediawiki.py
: Used to get "baseURL" and "path" with explicit type checking for str
Let me verify if there are any other usages with a different search pattern.
Return type change to str | dict | list
is safe to proceed
All callers of get_platform_metadata
are properly handling the return types:
website.py
: Accepts list type for "resources" metadata (the reason for this change)github.py
: Has explicit type checking for dict type withisinstance(account, dict)
mediawiki.py
: Has explicit type checking for str type withisinstance(path, str) and isinstance(base_url, str)
modules_base.py
: Uses it for "userId" which is always expected to be str (converted to ObjectId)
No other usages were found that could be impacted by the addition of the list type.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Find all calls to get_platform_metadata to verify they handle the new list return type
# Search for direct method calls
rg -A 5 "get_platform_metadata\("
# Search for variable assignments that might need type checking
ast-grep --pattern 'metadata = $_.get_platform_metadata($$$)'
Length of output: 3453
Script:
#!/bin/bash
# Search for any other variable assignments or usages of get_platform_metadata
rg -A 2 "= .*get_platform_metadata\("
rg -A 2 "get_platform_metadata\(.*\)\."
Length of output: 1728
dags/hivemind_etl_helpers/website_etl.py (1)
31-50
: Verify that crawlee_client.crawl
is asynchronous
Ensure that the crawl
method in CrawleeClient
is an asynchronous function since it's being awaited in the extract
method.
Run the following script to confirm the crawl
method is asynchronous:
✅ Verification successful
Let's search for the implementation of the CrawleeClient class to verify if the crawl method is asynchronous. If we don't find it directly, we can try other search patterns.
crawl
method is correctly implemented as asynchronous
The crawl
method in CrawleeClient
is properly defined as an async function and uses await statements for asynchronous operations with the crawler, which matches the usage in the extract
method.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if `crawl` method in `CrawleeClient` is asynchronous.
# Search for the definition of `crawl` method
fd --type f --name '*.py' | xargs grep -E 'async def crawl\('
Length of output: 314
Script:
#!/bin/bash
# Let's try a different approach using ripgrep to find the CrawleeClient class and its crawl method
rg -A 5 "class CrawleeClient"
Length of output: 584
Script:
#!/bin/bash
# Let's examine the full content of the crawlee_client.py file to find the crawl method
cat dags/hivemind_etl_helpers/src/db/website/crawlee_client.py
Length of output: 2120
to resolve a dependency conflict for python-dateutil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (6)
dags/hivemind_etl_helpers/tests/unit/test_website_etl.py (6)
10-18
: Consider removing environment variable loading from unit testsLoading environment variables in unit tests can lead to non-deterministic behavior and make tests dependent on the environment. Consider mocking any required environment variables directly in the test setup for better isolation and reproducibility.
def setUp(self): """ Setup for the test cases. Initializes a WebsiteETL instance with mocked dependencies. """ - load_dotenv() self.community_id = "test_community" self.website_etl = WebsiteETL(self.community_id) self.website_etl.crawlee_client = AsyncMock() self.website_etl.ingestion_pipeline = MagicMock()
20-38
: Add error handling test casesThe test covers the happy path but should also verify error handling scenarios. Consider adding test cases for:
- Empty URL list
- Failed crawl attempts
- Invalid URLs
Example addition:
async def test_extract_empty_urls(self): """Test extract method with empty URL list.""" urls = [] extracted_data = await self.website_etl.extract(urls) self.assertEqual(extracted_data, []) self.website_etl.crawlee_client.crawl.assert_not_awaited() async def test_extract_crawl_failure(self): """Test extract method when crawl fails.""" urls = ["https://example.com"] self.website_etl.crawlee_client.crawl.side_effect = Exception("Crawl failed") with self.assertRaises(Exception): await self.website_etl.extract(urls)
39-64
: Add edge case tests for transform methodThe current test covers the basic transformation, but should also verify:
- Empty raw data
- Missing fields in the input data
- Special characters in text/title
Example addition:
def test_transform_empty_data(self): """Test transform method with empty data.""" self.assertEqual(self.website_etl.transform([]), []) def test_transform_missing_fields(self): """Test transform method with missing fields.""" raw_data = [{"url": "https://example.com"}] # missing inner_text and title documents = self.website_etl.transform(raw_data) self.assertEqual(documents[0].text, "") # verify default handling
65-82
: Enhance load method test coverageConsider adding test cases for:
- Multiple documents
- Empty document list
- Pipeline failures
Example addition:
def test_load_multiple_documents(self): """Test load method with multiple documents.""" documents = [ Document(doc_id="doc1", text="text1"), Document(doc_id="doc2", text="text2") ] self.website_etl.load(documents) self.website_etl.ingestion_pipeline.run_pipeline.assert_called_once_with( docs=documents ) def test_load_pipeline_failure(self): """Test load method when pipeline fails.""" self.website_etl.ingestion_pipeline.run_pipeline.side_effect = Exception("Pipeline failed") with self.assertRaises(Exception): self.website_etl.load([Document(doc_id="doc1", text="text1")])
83-92
: Add input validation tests for _prepare_id methodThe current test only verifies basic string hashing. Consider adding test cases for:
- Empty strings
- Non-string inputs
- Unicode characters
- Very long inputs
Example addition:
def test_prepare_id_edge_cases(self): """Test _prepare_id method with various edge cases.""" # Empty string self.assertTrue(WebsiteETL._prepare_id("")) # Unicode unicode_text = "Hello 世界" self.assertTrue(WebsiteETL._prepare_id(unicode_text)) # Long string long_text = "a" * 10000 self.assertTrue(WebsiteETL._prepare_id(long_text)) # Non-string input with self.assertRaises(AttributeError): WebsiteETL._prepare_id(None)
1-92
: Consider improving test organization and documentationWhile the test coverage is good for basic scenarios, consider these improvements:
- Group related test cases using nested TestCase classes
- Add docstring examples showing input/output for complex scenarios
- Consider using parameterized tests for testing multiple similar cases
- Add integration tests for the complete ETL pipeline
Example test organization:
class TestWebsiteETL(IsolatedAsyncioTestCase): class TestExtract(IsolatedAsyncioTestCase): # Extract-related tests class TestTransform(IsolatedAsyncioTestCase): # Transform-related tests class TestLoad(IsolatedAsyncioTestCase): # Load-related tests
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
dags/hivemind_etl_helpers/src/db/website/crawlee_client.py
(1 hunks)dags/hivemind_etl_helpers/tests/unit/test_website_etl.py
(1 hunks)dags/hivemind_etl_helpers/website_etl.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- dags/hivemind_etl_helpers/website_etl.py
🧰 Additional context used
🪛 Ruff
dags/hivemind_etl_helpers/src/db/website/crawlee_client.py
74-74: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
116-116: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Outside diff range and nitpick comments (10)
dags/hivemind_etl_helpers/website_etl.py (4)
3-3
: Remove unused importThe
hashlib
module is imported but never used in the code.Apply this diff to remove the unused import:
-import hashlib
🧰 Tools
🪛 Ruff
3-3:
hashlib
imported but unusedRemove unused import:
hashlib
(F401)
14-19
: Enhance class and method documentationThe docstring could be more comprehensive by including:
- Class purpose and responsibilities
- Description of the collection_name parameter
- Examples of usage
Consider expanding the docstring:
""" +WebsiteETL handles the Extract, Transform, Load pipeline for website data. + Parameters ----------- community_id : str the community to save its data +collection_name : str + name of the collection in the vector database (defaults to "website") + +Example +------- +>>> etl = WebsiteETL(community_id="community123") +>>> documents = await etl.extract(urls=["https://example.com"]) +>>> transformed = etl.transform(documents) +>>> etl.load(transformed) """
29-49
: Add URL validationConsider validating the input URLs before crawling to ensure they are well-formed and accessible.
Consider adding validation:
async def extract( self, urls: list[str], ) -> list[dict[str, Any]]: + # Validate URLs + from urllib.parse import urlparse + for url in urls: + result = urlparse(url) + if not all([result.scheme, result.netloc]): + raise ValueError(f"Invalid URL format: {url}") + extracted_data = await self.crawlee_client.crawl(urls) return extracted_data
67-67
: Consider using a more robust document ID strategyUsing the URL directly as the document ID might lead to collisions if the same URL is processed multiple times. Consider using a combination of URL and timestamp or a hash of the content.
Consider implementing a more robust ID strategy:
-doc_id = data["url"] +from datetime import datetime +doc_id = f"{data['url']}_{datetime.utcnow().isoformat()}"dags/hivemind_etl_helpers/tests/unit/test_website_etl.py (6)
6-6
: Remove unused importThe
hashlib
module is imported but never used in the code.-from llama_index.core import Document -import hashlib +from llama_index.core import Document🧰 Tools
🪛 Ruff
6-6:
hashlib
imported but unusedRemove unused import:
hashlib
(F401)
11-13
: Enhance setUp docstringThe docstring could be more descriptive about the mocked components and their purposes.
- """ - Setup for the test cases. Initializes a WebsiteETL instance with mocked dependencies. - """ + """ + Setup for the test cases. + + Initializes a WebsiteETL instance with: + - A mocked crawlee_client for simulating web crawling + - A mocked ingestion_pipeline for testing data loading + """
20-38
: Enhance test coverage for extract methodWhile the happy path is well tested, consider adding these test cases:
- Empty URL list
- Failed crawl scenario
- Malformed response data
Example implementation:
async def test_extract_empty_urls(self): """Test extract method with empty URL list.""" urls = [] extracted_data = await self.website_etl.extract(urls) self.assertEqual(extracted_data, []) self.website_etl.crawlee_client.crawl.assert_awaited_once_with(urls) async def test_extract_crawl_failure(self): """Test extract method when crawl fails.""" urls = ["https://example.com"] self.website_etl.crawlee_client.crawl.side_effect = Exception("Crawl failed") with self.assertRaises(Exception): await self.website_etl.extract(urls)
39-64
: Add edge cases to transform method testsConsider adding test cases for:
- Empty input data
- Missing optional fields
- Special characters in text/title
Example implementation:
def test_transform_edge_cases(self): """Test transform method with edge cases.""" # Test empty input self.assertEqual(self.website_etl.transform([]), []) # Test missing optional fields raw_data = [{"url": "https://example.com", "inner_text": "text"}] # missing title documents = self.website_etl.transform(raw_data) self.assertEqual(documents[0].metadata.get("title"), None) # Test special characters raw_data = [{ "url": "https://example.com", "inner_text": "Text with 特殊字符", "title": "Title with 🚀" }] documents = self.website_etl.transform(raw_data) self.assertEqual(documents[0].text, "Text with 特殊字符")
65-81
: Enhance load method test coverageThe current test covers the basic case, but consider adding:
- Error handling test for pipeline failures
- Test with empty document list
- Test with multiple documents
Example implementation:
def test_load_pipeline_failure(self): """Test load method when pipeline fails.""" documents = [Document(doc_id="test", text="text")] self.website_etl.ingestion_pipeline.run_pipeline.side_effect = Exception("Pipeline failed") with self.assertRaises(Exception): self.website_etl.load(documents) def test_load_empty_documents(self): """Test load method with empty document list.""" self.website_etl.load([]) self.website_etl.ingestion_pipeline.run_pipeline.assert_called_once_with(docs=[]) def test_load_multiple_documents(self): """Test load method with multiple documents.""" documents = [ Document(doc_id="1", text="text1"), Document(doc_id="2", text="text2") ] self.website_etl.load(documents) self.website_etl.ingestion_pipeline.run_pipeline.assert_called_once_with(docs=documents)
1-81
: Consider implementing a test data factoryTo improve test maintainability and reduce duplication, consider implementing a test data factory class that generates test documents and raw data. This would make the tests more maintainable and ensure consistency across test cases.
Example implementation:
class WebsiteTestDataFactory: @staticmethod def create_raw_data(url="https://example.com", text="Example text", title="Example"): return { "url": url, "inner_text": text, "title": title } @staticmethod def create_document(doc_id="https://example.com", text="Example text", title="Example"): return Document( doc_id=doc_id, text=text, metadata={"title": title, "url": doc_id} )🧰 Tools
🪛 Ruff
6-6:
hashlib
imported but unusedRemove unused import:
hashlib
(F401)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
dags/hivemind_etl_helpers/tests/unit/test_website_etl.py
(1 hunks)dags/hivemind_etl_helpers/website_etl.py
(1 hunks)dags/hivemind_website_ingestion.py
(1 hunks)
🧰 Additional context used
🪛 Ruff
dags/hivemind_etl_helpers/tests/unit/test_website_etl.py
6-6: hashlib
imported but unused
Remove unused import: hashlib
(F401)
dags/hivemind_etl_helpers/website_etl.py
3-3: hashlib
imported but unused
Remove unused import: hashlib
(F401)
🔇 Additional comments (1)
dags/hivemind_website_ingestion.py (1)
50-51
: LGTM! DAG structure is correct.
The use of task.expand
for parallel processing of communities is appropriate and the task dependencies are correctly implied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Outside diff range and nitpick comments (6)
dags/hivemind_etl_helpers/website_etl.py (2)
9-27
: Consider making collection_name configurableThe collection name "website" is hardcoded. Consider making it configurable through the constructor to improve flexibility and reusability.
def __init__( self, community_id: str, + collection_name: str = "website", ) -> None: """ Parameters ----------- community_id : str the community to save its data + collection_name : str, optional + the name of the collection (default: "website") """ self.community_id = community_id - collection_name = "website" # preparing the data extractor and ingestion pipelines self.crawlee_client = CrawleeClient() self.ingestion_pipeline = CustomIngestionPipeline( self.community_id, collection_name=collection_name )
49-78
: Add validation for created documentsConsider validating the created documents to ensure they contain meaningful content before returning them.
def transform(self, raw_data: list[dict[str, Any]]) -> list[Document]: """ transform raw data to llama-index documents ... """ documents: list[Document] = [] for data in raw_data: doc_id = data["url"] doc = Document( doc_id=doc_id, text=data["inner_text"], metadata={ "title": data["title"], "url": data["url"], }, ) + # Validate document content + if not doc.text.strip(): + continue documents.append(doc) + if not documents: + raise ValueError("No valid documents created from the raw data") return documentsdags/hivemind_etl_helpers/tests/unit/test_website_etl.py (4)
10-18
: Consider mocking environment variables instead of loading them in setUpLoading real environment variables in setUp could lead to test isolation issues and make tests dependent on the environment. Consider mocking environment variables or setting up test-specific values.
def setUp(self): """ Setup for the test cases. Initializes a WebsiteETL instance with mocked dependencies. """ - load_dotenv() self.community_id = "test_community" self.website_etl = WebsiteETL(self.community_id) self.website_etl.crawlee_client = AsyncMock() self.website_etl.ingestion_pipeline = MagicMock()
20-38
: Enhance test coverage with error cases and data validationWhile the happy path is well tested, consider adding:
- Error cases (e.g., network failures, invalid URLs)
- Edge cases (e.g., empty URL list, malformed responses)
- Schema validation for the mocked_data structure
Example additional test:
async def test_extract_with_network_error(self): """Test the extract method when crawling fails.""" urls = ["https://example.com"] self.website_etl.crawlee_client.crawl.side_effect = Exception("Network error") with self.assertRaises(Exception): await self.website_etl.extract(urls)
39-64
: Add test cases for edge cases in transform methodConsider adding tests for:
- Empty or missing fields in raw_data
- Special characters in text/title
- Multiple documents transformation
- Invalid URLs
Example additional test case:
def test_transform_with_missing_fields(self): """Test transform with incomplete data.""" raw_data = [ { "url": "https://example.com", # missing inner_text and title } ] documents = self.website_etl.transform(raw_data) self.assertEqual(documents[0].text, "") # or expected fallback
65-81
: Enhance load testing with error handling and scalability testsConsider adding tests for:
- Pipeline errors or exceptions
- Large document sets to verify performance
- Document validation failures
Example additional test:
def test_load_with_pipeline_error(self): """Test load method when pipeline fails.""" documents = [Document(doc_id="test", text="text")] self.website_etl.ingestion_pipeline.run_pipeline.side_effect = Exception("Pipeline error") with self.assertRaises(Exception): self.website_etl.load(documents)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
dags/hivemind_etl_helpers/src/db/website/crawlee_client.py
(1 hunks)dags/hivemind_etl_helpers/tests/unit/test_website_etl.py
(1 hunks)dags/hivemind_etl_helpers/website_etl.py
(1 hunks)dags/hivemind_website_ingestion.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- dags/hivemind_website_ingestion.py
🧰 Additional context used
🪛 Ruff
dags/hivemind_etl_helpers/src/db/website/crawlee_client.py
75-75: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
117-117: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
🔇 Additional comments (3)
dags/hivemind_etl_helpers/website_etl.py (2)
1-8
: LGTM! Well-structured ETL implementation
The imports are appropriate and the class structure follows the ETL pattern with clear separation of concerns.
79-89
: Past review comment about error handling is still applicable
The previous review comment suggesting error handling for the pipeline execution is still valid and should be addressed.
dags/hivemind_etl_helpers/src/db/website/crawlee_client.py (1)
1-14
: LGTM! Clean imports and class definition.
The imports are well-organized, and the class is properly defined with type hints. Good use of defusedxml
for secure XML parsing.
There was a new pre-release version that was causing errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (7)
dags/hivemind_etl_helpers/website_etl.py (5)
9-27
: Add input validation for community_idConsider validating the community_id parameter to ensure it's not empty or malformed.
def __init__( self, community_id: str, ) -> None: + if not community_id or not isinstance(community_id, str): + raise ValueError("community_id must be a non-empty string") self.community_id = community_id collection_name = "website"
28-53
: Add URL format validationConsider validating the URL format before crawling to fail fast on malformed URLs.
+from urllib.parse import urlparse + async def extract( self, urls: list[str], ) -> list[dict[str, Any]]: if not urls: raise ValueError("No URLs provided for crawling") + + for url in urls: + try: + result = urlparse(url) + if not all([result.scheme, result.netloc]): + raise ValueError(f"Invalid URL format: {url}") + except Exception as e: + raise ValueError(f"Invalid URL: {url}. Error: {str(e)}") + extracted_data = await self.crawlee_client.crawl(urls)
54-83
: Add input validation for raw_data structureConsider validating the structure of raw_data entries before processing to ensure required fields are present.
def transform(self, raw_data: list[dict[str, Any]]) -> list[Document]: + required_fields = {"url", "inner_text", "title"} + for data in raw_data: + missing_fields = required_fields - set(data.keys()) + if missing_fields: + raise ValueError( + f"Missing required fields {missing_fields} in data: {data}" + ) + documents: list[Document] = [] for data in raw_data:
84-94
: Add input validation for documentsConsider validating the documents list to ensure it's not empty before attempting to load.
def load(self, documents: list[Document]) -> None: + if not documents: + raise ValueError("No documents provided for loading") + # loading data into db self.ingestion_pipeline.run_pipeline(docs=documents)
28-53
: Consider implementing rate limiting for web crawlingThe extract method could benefit from rate limiting to avoid overwhelming target websites. Consider implementing rate limiting in the CrawleeClient or at this layer.
dags/hivemind_etl_helpers/src/db/website/crawlee_client.py (2)
21-23
: Document configuration choices.Consider adding a docstring or comment explaining why persistence and metadata writing are disabled. This will help future maintainers understand the rationale behind these configuration choices.
25-44
: Add URL validation in the request handler.Consider validating the URL format before processing to prevent potential issues with malformed URLs. This is especially important since the handler is processing both regular pages and sitemaps.
async def request_handler(context: PlaywrightCrawlingContext) -> None: context.log.info(f"Processing {context.request.url} ...") + if not context.request.url or not context.request.url.startswith(("http://", "https://")): + context.log.warning(f"Skipping invalid URL: {context.request.url}") + return inner_text = await context.page.inner_text(selector="body")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
dags/hivemind_etl_helpers/src/db/website/crawlee_client.py
(1 hunks)dags/hivemind_etl_helpers/website_etl.py
(1 hunks)dags/hivemind_website_ingestion.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- dags/hivemind_website_ingestion.py
🧰 Additional context used
📓 Learnings (1)
dags/hivemind_etl_helpers/src/db/website/crawlee_client.py (1)
Learnt from: amindadgar
PR: TogetherCrew/airflow-dags#325
File: dags/hivemind_etl_helpers/src/db/website/crawlee_client.py:15-24
Timestamp: 2024-11-19T06:57:30.667Z
Learning: In this project, the `PlaywrightCrawler` does not support the attributes `max_requests_per_second` and `respect_robots_txt`.
🪛 Ruff
dags/hivemind_etl_helpers/src/db/website/crawlee_client.py
74-74: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
116-116: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
🔇 Additional comments (4)
dags/hivemind_etl_helpers/website_etl.py (1)
1-8
: LGTM: Clean imports and class definition
The imports are appropriate and the class is well-defined following Python conventions.
dags/hivemind_etl_helpers/src/db/website/crawlee_client.py (3)
1-14
: LGTM: Clean imports and well-structured class definition.
The imports are appropriate, and type hints are used correctly. The class constructor parameters are well-documented with sensible defaults.
115-116
: 🛠️ Refactor suggestion
Add exception chaining for timeout error.
Preserve the exception context by using raise ... from e
.
except asyncio.TimeoutError as e:
- raise TimeoutError("Crawl operation timed out")
+ raise TimeoutError("Crawl operation timed out") from e
Likely invalid or redundant comment.
🧰 Tools
🪛 Ruff
116-116: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
73-74
: 🛠️ Refactor suggestion
Add exception chaining for better error tracking.
When re-raising exceptions, use raise ... from e
to preserve the original exception context.
except ET.ParseError as e:
- raise ValueError(f"Invalid sitemap XML: {str(e)}")
+ raise ValueError(f"Invalid sitemap XML: {str(e)}") from e
Likely invalid or redundant comment.
🧰 Tools
🪛 Ruff
74-74: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
Summary by CodeRabbit
Release Notes
New Features
CrawleeClient
for enhanced web crawling capabilities.ModulesWebsite
class for managing website community data.WebsiteETL
class for streamlined data extraction, transformation, and loading from URLs.website_ingestion_embedding
for processing website community data.Bug Fixes
get_platform_metadata
method for improved flexibility.Chores
tc-analyzer-lib
to version1.4.13
.crawlee[playwright]==0.3.8
,defusedxml==0.7.1
, andpydantic==2.9.2
.Tests
WebsiteETL
class to ensure core functionalities are working as intended.